Project Discription

The aim of this project is to extract titles and content of a news website in order to find common words used in the news. The name of website is, https://www.foreigner.fi. We are extracting only National news from this website (https://www.foreigner.fi/blog/section/national). For this project, we have taken first 30 pages from the website.

This is a group work by Pawan Singh, Shree Sapkota and Swostik Shrestha.

Importing libraries

Data collection using requests

Here, 'req' is the response object and we can get all the information from this object.

Now we will find all the links existed in the link in the given url.

If we check the first element in our list, 'link', the address is still not complete. We need to add 'https:// www .foreigner. fi' to all links. Lets do it as follows.

Now, we can check if we get successfull response when we send a get request to these links as follows.

So, from first link we get a successful response (200)

Finding all links from different pages

We will use 30 pages (each page has 20 links of news) to extract all the links (20*30=600 links) and then extract data from each links below.

There are 20 news headlines (20 links) in each page. So, in 30 pages, we have.. 30*20 = 600 news headlines (links)

Now, from each link we will extract the data. The data includes, news headline, date of publication and the contents of the news.

Now we have three list as shown above and now we will covert these lists into pandas dataframe and do the data processing and analysis.

Converting into pandas dataframe

We will also count the number of words in 'Titles' and 'Content' and add them in our dataframe as show as below.

From the above dataframe we can see, Dates, Titles and its contents. Also, we can see how many words a title have and how many words the contents have

Saving a file to local machine

We will save the dataframe to our local machine so that we dont have to run all the above codes again and again.

Reading a pickle file

When we save a dataframe to a pickle file in local computer we can read it using pd.read_pickle()

Data Cleaning

From the information we can see taht there are 600 rows of data. First we convert 'Dates' column in Datatime format. We will convert all the text in 'Title' and in 'Contents' in lowercase, so that its easy to count the words. We will also remove unnessary words in the text as defined in stopwords below.

Stop words are commonly used words like 'for', 'to', 'on' etc. We will be removing all these stop words from the titles and contents.

In this stop_words, some of the words which we do not want to include in this project are not included. For instance, we would like to remove words like 'also', 'would' which are not in the list of stop_words. Lets add them in the list of stop_words.

We have 182 stopwords which we will remove from our 'Titles' and 'Content' column as we are not interested in these stop words.

Data analysis

Average, Maximum and Mininum words in News Articles

Average, Maximum and Mininum words in News Title

News post by hour

If we see the above figure, we can see that the most number of news are published at 8pm. There are no news publised from 1 am to 4pm.

News post by Month

From the above figure, we see that Month of June 2020 (44 news) has the highest frequencies of news. There was only 1 news published in the month of September, 2019.

Counting number of unique words in titles

We will count the number of unique words in news titles and news content.

Wordcloud plot (news titles)

Top 10 common words in news headlines

We see that in National news, the most used words in titles is 'Finland' and the second most used words is 'police'.

Counting number of unique words in Content

As we can see in the 'new_content' that there are unnecessary letters in our words. We have words like 'affairs.' with punctuations. Hence, we need to get rid of full stops or commas from our words. For this we will use regex.

Let us remove the stop_words as we have done above and count the frequencies using Counter method.

Wordcloud plot (News Contents)

Top 10 used words